2.Loading Data
Read the dataset (e.g., u.data, u.item, u.user) into R:
ratings <- read.table("data/ml-100k/u.data", sep="\t",
col.names=c("user_id", "item_id", "rating", "timestamp"))
users <- read.table("data/ml-100k/u.user", sep="|",
col.names=c("user_id", "age", "gender", "occupation", "zip_code"))
movies <- read.table("data/ml-100k/u.item", sep="|",
col.names=c("item_id", "movie_title", "release_date", "video_release_date",
"IMDb_URL", "unknown", "Action", "Adventure", "Animation", "Children",
"Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film_Noir",
"Horror", "Musical", "Mystery", "Romance", "Sci_Fi", "Thriller",
"War", "Western"),
fill = TRUE, encoding = "latin1", quote="")
The next step is to merge the loaded data into a single dataframe,
using the R merge function:
merged_df <- merge(ratings, users, by = "user_id", all.x = TRUE)
merged_df <- merge(merged_df, movies, by = "item_id", all.x = TRUE)
str(merged_df)
'data.frame': 100000 obs. of 31 variables:
$ item_id : int 1 1 1 1 1 1 1 1 1 1 ...
$ user_id : int 1 117 429 919 457 468 17 892 16 580 ...
$ rating : int 5 4 3 4 4 5 4 5 5 3 ...
$ timestamp : int 874965758 880126083 882385785 875289321 882393244 875280395 885272579 886608185 877717833 884125243 ...
$ age : int 24 20 27 25 33 28 30 36 21 16 ...
$ gender : chr "M" "M" "M" "M" ...
$ occupation : chr "technician" "student" "student" "other" ...
$ zip_code : chr "85711" "16125" "29205" "14216" ...
$ movie_title : chr "Toy Story (1995)" "Toy Story (1995)" "Toy Story (1995)" "Toy Story (1995)" ...
$ release_date : chr "01-Jan-1995" "01-Jan-1995" "01-Jan-1995" "01-Jan-1995" ...
$ video_release_date: logi NA NA NA NA NA NA ...
$ IMDb_URL : chr "http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)" "http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)" "http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)" "http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)" ...
$ unknown : int 0 0 0 0 0 0 0 0 0 0 ...
$ Action : int 0 0 0 0 0 0 0 0 0 0 ...
$ Adventure : int 0 0 0 0 0 0 0 0 0 0 ...
$ Animation : int 1 1 1 1 1 1 1 1 1 1 ...
$ Children : int 1 1 1 1 1 1 1 1 1 1 ...
$ Comedy : int 1 1 1 1 1 1 1 1 1 1 ...
$ Crime : int 0 0 0 0 0 0 0 0 0 0 ...
$ Documentary : int 0 0 0 0 0 0 0 0 0 0 ...
$ Drama : int 0 0 0 0 0 0 0 0 0 0 ...
$ Fantasy : int 0 0 0 0 0 0 0 0 0 0 ...
$ Film_Noir : int 0 0 0 0 0 0 0 0 0 0 ...
$ Horror : int 0 0 0 0 0 0 0 0 0 0 ...
$ Musical : int 0 0 0 0 0 0 0 0 0 0 ...
$ Mystery : int 0 0 0 0 0 0 0 0 0 0 ...
$ Romance : int 0 0 0 0 0 0 0 0 0 0 ...
$ Sci_Fi : int 0 0 0 0 0 0 0 0 0 0 ...
$ Thriller : int 0 0 0 0 0 0 0 0 0 0 ...
$ War : int 0 0 0 0 0 0 0 0 0 0 ...
$ Western : int 0 0 0 0 0 0 0 0 0 0 ...
Let’s inspect the data in order to understand its structure. View the
first 500 rows:
library(DT)
datatable(head(merged_df, 500),
options = list(scrollY = "400px", scrollX = TRUE),
rownames = FALSE)
The dataframe consists of 100 000 observations of 31 variables, with
the following data type/classes:
data_types <- sapply(merged_df, class)
data_type_count <- table(data_types)
print(data_type_count)
data_types
character integer logical
6 24 1
4 PCA Analysis
scaled_features <- scale(df_final) # Standardize features
pca_result <- prcomp(scaled_features, center = TRUE, scale. = TRUE)
summary(pca_result) # View variance explained
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.4994 1.40233 1.29543 1.18026 1.08093 1.05792 1.04389
Proportion of Variance 0.1022 0.08939 0.07628 0.06332 0.05311 0.05087 0.04953
Cumulative Proportion 0.1022 0.19158 0.26786 0.33118 0.38429 0.43516 0.48469
PC8 PC9 PC10 PC11 PC12 PC13 PC14
Standard deviation 1.02498 1.00372 0.99951 0.98318 0.9693 0.94294 0.94025
Proportion of Variance 0.04775 0.04579 0.04541 0.04394 0.0427 0.04042 0.04019
Cumulative Proportion 0.53244 0.57824 0.62365 0.66759 0.7103 0.75071 0.79089
PC15 PC16 PC17 PC18 PC19 PC20 PC21
Standard deviation 0.90321 0.86802 0.85308 0.79792 0.77282 0.65908 0.60505
Proportion of Variance 0.03708 0.03425 0.03308 0.02894 0.02715 0.01974 0.01664
Cumulative Proportion 0.82797 0.86222 0.89530 0.92424 0.95139 0.97113 0.98777
PC22
Standard deviation 0.51869
Proportion of Variance 0.01223
Cumulative Proportion 1.00000
# Plot the first two principal components to visualize the results
biplot(pca_result, scale = 0)

Since we are doing PCA to reduce dimensionality before clustering, a
3D visualization of the first three principal components (PC1, PC2, PC3)
colored by rating will help us understand how ratings are distributed in
the transformed feature space. (in the HTML version of the
report the 3D plots are interactive)
library(rgl)
pca_data <- data.frame(pca_result$x[, 1:3]) # Extract first 3 PCs
pca_data$rating <- as.factor(df_final$rating) # Convert rating to factor for coloring
# Define colors based on rating
rating_levels <- sort(unique(df_final$rating)) # Ensure sorted order
rating_colors <- rainbow(length(rating_levels)) # Generate distinct colors
colors <- rating_colors[as.numeric(pca_data$rating)] # Assign colors to points
# Open an interactive 3D plot with rgl
open3d()
glX
1
plot3d(pca_data$PC1, pca_data$PC2, pca_data$PC3, col = colors, size = 2,
xlab = "PC1", ylab = "PC2", zlab = "PC3",
main = "PCA: Rating Representation in PC1, PC2, PC3")
legend3d("topright", legend = rating_levels, pch = 19, col = rating_colors, title = "Ratings")
rglwidget()
In order to decide how Many PCA Components to
retain
We will consider two different approaches [2,6,11,19]:
The Kaiser Criterion
Cumulative Variance Explained: Retaining
components that explain a sufficient percentage (e.g., 90%) of the total
variance, regardless of the eigenvalue threshold.
4.1 The Kaiser Criterion
is a rule of thumb used in Principal Component Analysis (PCA) for
determining the number of principal components (PCs) to retain
[11,23].
It is based on the idea that we should only keep the components that
explain a significant amount of variance in the data.
The Kaiser Criterion suggests retaining only those principal
components whose eigenvalues are greater than 1 [6,11}.
Eigenvalues represent the amount of variance captured by each
principal component.
A component with an eigenvalue greater than 1 is considered to
explain more variance than a single original variable ((which has an
eigenvalue of 1)), and thus, it’s deemed useful to retain.
If a component has an eigenvalue less than 1, it means it is
explaining less variance than one of the original variables, and
therefore might not be worth retaining.
# Get the eigenvalues (sdev squared)
eigenvalues <- (pca_result$sdev)^2
# Apply the Kaiser Criterion: Retain components with eigenvalue > 1
components_to_retain <- which(eigenvalues > 1)
# Number of components to retain
num_components <- length(components_to_retain)
# Print the components to retain and their eigenvalues
cat("Number of components to retain:", num_components, "\n")
Number of components to retain: 9
k<- num_components # Critério de Kaiser
pca_reduced <- as.data.frame(pca_result$x[, 1:k]) # Keep first 'k' principal components
# Check dimensions
dim(pca_reduced) # Should be (num_samples, k)
[1] 20000 9
4.2 Threshold Approach (95% Variance Explained)
As an alternative we can select the smallest number of PCs that
capture at least 90% of the total variance.
# Compute the proportion of variance explained by each principal component
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
# Compute the cumulative variance
cumulative_variance <- cumsum(variance_explained)
# Find the number of components that explain at least 90% of the variance
num_components_90 <- which(cumulative_variance >= 0.90)[1]
# Print results
cat("Number of components needed to explain at least 90% variance:", num_components_90, "\n")
Number of components needed to explain at least 90% variance: 18
The Kaiser Criterion suggests retaining 9 components
(with eigenvalues greater than 1).
This method helps eliminate noise and retain only significant PCs
while avoiding redundant or less informative dimensions.
This way, selecting 9 principal components (PCs) based on the Kaiser
Criterion instead of 18 PCs covering 90% cumulative variance we are
looking for a balance between dimensionality reduction, computational
efficiency, and meaningful variance retention for clustering.
5 Clustering on PCA Reduced (Kaiser criterium)
In this case, retaining 9 PCs ensures that only the most informative,
high-variance dimensions are used for clustering
The second reason is Computational Efficiency for
Hierarchical Clustering that we intend to do later. Hierarchical
clustering is computationally expensive (O(n²) or worse complexity),
making it impractical with high-dimensional data [2,3].
Reducing from 19 to 9 PCs significantly reduces the computational
load while preserving a high level of variance (though not exactly 90%).
A lower-dimensional space (9D instead of 19D) results in faster distance
calculations and lower memory usage.
Also we are Reducing Redundant Information since The
first few PCs often capture the most important structure of the data,
while later PCs mostly capture small variations or noise.The Kaiser
Criterion helps avoid overfitting by excluding components that
contribute little meaningful structure for clustering.
Thus, applying clustering on 9 PCA components is a better trade-off
between information retention and computational feasibility, making
hierarchical clustering more efficient and scalable.
Now that we have reduced dimensionality with PCA, let’s perform
clustering using K-Means, Hierarchical Clustering, DBSCAN and GMM
5.1 K-Means Clustering
K-Means clusters similar users or movies based on PCA-extracted
features.
Determining the Optimal Number of Clusters (Elbow
Method)
To identify the optimal number of clusters (K), we will use the Elbow
Method. The Elbow Method is a technique used to determine the optimal
number of clusters (K) in K-means clustering by analyzing the
Within-Cluster Sum of Squares (WCSS) or total inertia. It helps balance
cluster compactness and computational efficiency [11,15,27].
The Total Within-Cluster Sum of Squares (WSS) is a key measure used
to evaluate the quality of a clustering solution, specifically in the
context of K-means clustering [11,17].
In K-means clustering, the goal is to partition a dataset into \(k\) clusters, where each point in the
dataset is assigned to the cluster whose centroid is closest to the
point [17].
The Total Within-Cluster Sum of Squares measures how compact the
clusters are, and it quantifies the variability of the points within
each cluster. It is computed as the sum of squared distances between
each data point and the centroid of its respective cluster [17,27].
\[
\text{WSS} = \sum_{i=1}^{k} \sum_{x \in C_i} (x - \mu_i)^2
\]
Where: - \(\text{WSS}\) is the Total
Within-Cluster Sum of Squares.
\(k\) is the number of
clusters.
\(C_i\) represents the set of
data points assigned to the \(i^{th}\)
cluster.
\(x\) is a data point.
\(\mu_i\) is the centroid (mean)
of the \(i^{th}\) cluster.
A small total within-cluster sum of squares means that the points
within each cluster are close to the centroid, suggesting that the
clustering solution is tight and well-defined. The clusters are compact.
Whereas, a larger WSS indicates that the points within each cluster are
more spread out, suggesting that the clustering solution is less
compact, and the clusters may not be well-separated or distinct
[5,17].
In the Elbow Method for determining the optimal number of clusters,
we plot the WSS for different values of \(k\) (the number of clusters).
As the number of clusters increases, the WSS typically decreases
because having more clusters will lead to each cluster containing fewer
data points, making the points closer to the centroid. The “elbow” point
on the graph, where the rate of decrease in WSS slows down, indicates
the optimal number of clusters [17].
This is the point where adding more clusters doesn’t significantly
improve the quality of the clustering (i.e., the reduction in WSS is
marginal).
library(ggplot2)
# Use the Elbow Method to find the optimal number of clusters
wss <- numeric(20) # Store within-cluster sum of squares for 1 to 10 clusters
for (i in 1:20) {
kmeans_temp <- kmeans(pca_reduced, centers = i, nstart = 25)
wss[i] <- kmeans_temp$tot.withinss
}
# Create a data frame to use for ggplot
elbow_data <- data.frame(Clusters = 1:20, WSS = wss)
# Create the Elbow plot using ggplot2 with integer x-axis labels
ggplot(elbow_data, aes(x = Clusters, y = WSS)) +
geom_line() +
geom_point() +
ggtitle("Elbow Method for Optimal Number of Clusters") +
xlab("Number of Clusters") +
ylab("Total Within-Cluster Sum of Squares") +
scale_x_continuous(breaks = 1:20) + # Set x-axis labels as integers
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))

As we look for a compromise between number of clusters and
computational load, we will choose k = 9 from the Elbow Method:
set.seed(123) # For reproducibility
k_optimal <- 9
kmeans_model <- kmeans(pca_reduced, centers = k_optimal, nstart = 25)
# Add cluster labels to the dataset
pca_reduced$cluster <- as.factor(kmeans_model$cluster)
# View cluster distribution
table(pca_reduced$cluster)
1 2 3 4 5 6 7 8 9
707 1060 3649 5174 245 6492 1466 1046 161
# Plot clusters in 3D using rgl
open3d()
glX
3
# Define colors based on cluster labels
colors <- rainbow(k_optimal)[as.numeric(pca_reduced$cluster)]
# Plot the 3D scatter plot
plot3d(pca_reduced$PC1, pca_reduced$PC2, pca_reduced$PC3,
col = colors, size = 2,
xlab = "PC1", ylab = "PC2", zlab = "PC3",
main = "3D PCA Plot with K-Means Clusters")
#type="s")
# Add legend for clusters
legend3d("topright", legend = levels(pca_reduced$cluster),
col = rainbow(k_optimal), pch = 19,
title = "Clusters", cex = 1)
rglwidget()
Cluster Profiling
Cluster profiling is the process of analyzing and interpreting the
characteristics of different clusters after performing clustering (e.g.,
K-Means) [4,12,27].
The goal is to understand what defines each cluster in terms of
meaningful patterns in the data, in order to:
Identify dominant features of each cluster.
Gain insights into user or item segmentation (e.g., different
types of customers or movie genres).
Supports decision-making by understanding how groups differ from
one another.
When we’re using PCA-reduced data for clustering, the clusters are in
the lower-dimensional PCA space. However, after clustering, we can still
interpret the clusters in terms of the original features by computing
the means of those original features for each cluster [4].
The clustering process in the PCA space captures the structure of the
data, but the cluster profiles we generate still relate to the original
features [4,11].
cluster_profile <- aggregate(df_final, by = list(cluster = pca_reduced$cluster), FUN = mean)
knitr::kable(cluster_profile, caption = "Cluster Profile - Mean Values by Cluster") %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "400px")
Cluster Profile - Mean Values by Cluster
|
cluster
|
rating
|
age
|
gender
|
unknown
|
Action
|
Adventure
|
Animation
|
Children
|
Comedy
|
Crime
|
Documentary
|
Drama
|
Fantasy
|
Film_Noir
|
Horror
|
Musical
|
Mystery
|
Romance
|
Sci_Fi
|
Thriller
|
War
|
Western
|
|
1
|
3.584158
|
31.23621
|
0.6760962
|
0.000000
|
0.0113154
|
0.1060820
|
0.8203678
|
0.9844413
|
0.3309760
|
0.0000000
|
0
|
0.0579915
|
0
|
0.000000
|
0.0000000
|
0.7001414
|
0.0000000
|
0.0198020
|
0.0141443
|
0.0113154
|
0.0113154
|
0.0000000
|
|
2
|
3.814151
|
36.69057
|
0.7726415
|
0.000000
|
0.0933962
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0481132
|
0.2783019
|
0
|
0.1990566
|
0
|
0.345283
|
0.0405660
|
0.0000000
|
0.7811321
|
0.0830189
|
0.1254717
|
0.7028302
|
0.0000000
|
0.0000000
|
|
3
|
3.334338
|
30.81968
|
0.7966566
|
0.000000
|
0.7344478
|
0.3143327
|
0.0030145
|
0.0317895
|
0.0545355
|
0.1145519
|
0
|
0.0887914
|
0
|
0.000000
|
0.0000000
|
0.0057550
|
0.0205536
|
0.0838586
|
0.1296246
|
0.5631680
|
0.0071252
|
0.0972869
|
|
4
|
3.363742
|
32.87167
|
0.7145342
|
0.000000
|
0.0458060
|
0.0402010
|
0.0189409
|
0.0786625
|
0.9460765
|
0.0237727
|
0
|
0.1246618
|
0
|
0.000000
|
0.0000000
|
0.0761500
|
0.0036722
|
0.3156165
|
0.0262853
|
0.0303440
|
0.0374952
|
0.0001933
|
|
5
|
3.204082
|
33.20408
|
0.7551020
|
0.000000
|
0.1877551
|
0.3918367
|
0.0938776
|
0.5755102
|
0.4489796
|
0.0897959
|
0
|
0.2612245
|
1
|
0.000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.1714286
|
0.4530612
|
0.0326531
|
0.0000000
|
0.0000000
|
|
6
|
3.741066
|
34.14171
|
0.7296673
|
0.000000
|
0.0811768
|
0.0044670
|
0.0000000
|
0.0123229
|
0.0217190
|
0.1033580
|
0
|
0.9882933
|
0
|
0.000000
|
0.0000000
|
0.0174060
|
0.0121688
|
0.2202711
|
0.0405114
|
0.1424831
|
0.1464880
|
0.0012323
|
|
7
|
3.743520
|
32.37858
|
0.7974079
|
0.000000
|
0.9072306
|
0.7980900
|
0.0075034
|
0.0027285
|
0.1214188
|
0.0000000
|
0
|
0.0559345
|
0
|
0.000000
|
0.0088677
|
0.0000000
|
0.0081855
|
0.3015007
|
0.8594816
|
0.1446112
|
0.4965894
|
0.0000000
|
|
8
|
3.248566
|
30.93403
|
0.7829828
|
0.000956
|
0.2313576
|
0.0411090
|
0.0095602
|
0.0000000
|
0.1500956
|
0.0277247
|
0
|
0.0717017
|
0
|
0.000000
|
0.9990440
|
0.0000000
|
0.0219885
|
0.0736138
|
0.1720841
|
0.3078394
|
0.0000000
|
0.0000000
|
|
9
|
3.695652
|
34.25466
|
0.7701863
|
0.000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
1
|
0.0621118
|
0
|
0.000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0869565
|
0.0000000
|
Based on this results, we can do now a cluster profiling and
interpretation.
This cluster interpretation gives a detailed view of how each group
behaves and their genre preferences based on their movie ratings and
demographic characteristics.
Cluster Interpretation:
Cluster: The cluster label (1 through 9) for each group.
Rating: The average rating of the users in that cluster.
Age: The average age of the users in that cluster.
Gender: The average gender proportion (presumably 1 = male, 0 = female).
Unknown: Proportion of unknown gender for each cluster.
Action, Adventure, Animation, etc.: Proportions of users who have rated movies in each genre.
Cluster 1:
Rating: 3.58 (moderate rating)
Age: 31.24 (middle-aged users)
Gender: 0.68 (68% male users)
Action: 1.13% (low interest in Action movies)
Adventure: 10.61% (moderate interest in Adventure movies)
Animation: 82.04% (high interest in Animation movies)
Other Genres: Low interest in genres such as Drama, Sci-Fi, and Thriller.
Interpretation: This cluster represents middle-aged male users who
are particularly interested in Animation movies, with relatively low
interest in Action and Adventure genres.
Cluster 2:
Rating: 3.81 (moderately high rating)
Age: 36.69 (older users)
Gender: 0.77 (77% male users)
Action: 9.34% (moderate interest in Action movies)
Adventure: 28.83% (moderate interest in Adventure movies)
Comedy: 27.83% (moderate interest in Comedy movies)
Other Genres: High interest in Sci-Fi (70.28%) and Thriller (70.28%).
Interpretation: Cluster 2 represents older male users who have a
strong preference for Sci-Fi and Thriller movies, with moderate interest
in Action, Adventure, and Comedy.
Cluster 3:
Rating: 3.33 (moderate rating)
Age: 30.82 (young adult users)
Gender: 0.80 (80% male users)
Action: 73.44% (very high interest in Action movies)
Adventure: 31.43% (moderate interest in Adventure movies)
Drama: 5.76% (minimal interest in Drama)
Other Genres: Moderate interest in Sci-Fi (56.32%) and Crime (12.72%).
Interpretation: Cluster 3 represents younger male users who have a
strong interest in Action and Adventure movies, with a more moderate
interest in Sci-Fi and Crime genres.
Cluster 4:
Rating: 3.36 (moderate rating)
Age: 32.87 (middle-aged users)
Gender: 0.71 (71% male users)
Action: 4.58% (moderate interest in Action movies)
Adventure: 7.87% (low interest in Adventure movies)
Animation: 94.61% (very high interest in Animation movies)
Other Genres: Low interest in genres such as Drama and Sci-Fi.
Interpretation: Cluster 4 represents middle-aged male users who have
a very high interest in Animation movies, with low interest in other
genres such as Action and Adventure.
Cluster 5:
Rating: 3.20 (moderate rating)
Age: 33.20 (middle-aged users)
Gender: 0.76 (76% male users)
Action: 18.78% (moderate interest in Action movies)
Adventure: 39.18% (high interest in Adventure movies)
Animation: 57.55% (moderate interest in Animation movies)
Other Genres: High interest in Comedy (44.12%) and low interest in Drama (26.12%).
Interpretation: Cluster 5 represents middle-aged male users who enjoy
Adventure and Comedy movies, with moderate interest in Animation and
Action.
Cluster 6:
Rating: 3.74 (high rating)
Age: 34.14 (middle-aged users)
Gender: 0.73 (73% male users)
Action: 8.12% (low interest in Action movies)
Drama: 98.83% (very high interest in Drama movies)
Other Genres: Moderate interest in Crime (98.83%) and low interest in genres like Sci-Fi and Thriller.
Interpretation: Cluster 6 represents middle-aged male users who have
a strong interest in Drama and Crime genres, with minimal interest in
Action and Sci-Fi.
Cluster 7:
Rating: 3.74 (high rating)
Age: 32.38 (middle-aged users)
Gender: 0.80 (80% male users)
Action: 90.72% (very high interest in Action movies)
Adventure: 79.81% (very high interest in Adventure movies)
Sci-Fi: 85.94% (very high interest in Sci-Fi movies)
Other Genres: Low interest in genres like Comedy and Thriller.
Interpretation: Cluster 7 represents middle-aged male users who are
very interested in Action, Adventure, and Sci-Fi genres.
Cluster 8:
Rating: 3.25 (moderate rating)
Age: 30.93 (young adult users)
Gender: 0.78 (78% male users)
Action: 23.14% (moderate interest in Action movies)
Adventure: 9.56% (low interest in Adventure movies)
Animation: 57.55% (moderate interest in Animation movies)
Other Genres: High interest in Drama (99.90%) and Thriller (80.27%).
Interpretation: Cluster 8 represents younger male users with a high
interest in Drama and Thriller movies, along with moderate interest in
Animation.
Cluster 9:
Rating: 3.70 (moderate rating)
Age: 34.25 (middle-aged users)
Gender: 0.77 (77% male users)
Action: 0.00% (no interest in Action movies)
Adventure: 0.00% (no interest in Adventure movies)
Drama: 100% (very high interest in Drama movies)
Other Genres: Minimal interest in Sci-Fi and other genres.
Interpretation: Cluster 9 represents middle-aged male users with a
strong interest in Drama movies, with no interest in Action or
Adventure.
Summary:
Cluster 1: Middle-aged male users with a strong preference for Animation.
Cluster 2: Older male users with a strong interest in Sci-Fi and Thriller genres.
Cluster 3: Younger male users interested in Action and Adventure movies.
Cluster 4: Middle-aged male users with a strong preference for Animation.
Cluster 5: Middle-aged male users with a high interest in Adventure and Comedy.
Cluster 6: Middle-aged male users with a strong preference for Drama and Crime genres.
Cluster 7: Middle-aged male users with a strong preference for Action, Adventure, and Sci-Fi.
Cluster 8: Younger male users with a high interest in Drama and Thriller.
Cluster 9: Middle-aged male users with a strong interest in Drama.
This analysis helps identify the preferences and demographics of
users across the different clusters, enabling better content
recommendations or targeted marketing strategies based on user
profiles.
Cluster Validation
To validate the clusters further, we can use the Silhouette
Score, which provides a measure of how well-separated and
well-formed our clusters are [5,11,19].
Silhouette Score
The silhouette score is a measure of how well each
point is clustered based on its cohesion (how close it
is to its own cluster) and separation (how far it is
from the next nearest cluster) [5,11].
For a data point \(i\), the
silhouette score is calculated as:
\[
s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}
\]
Where:
- \(a(i)\) = Average distance between
point \(i\) and all
other points in the same cluster (cohesion).
- \(b(i)\) = Average distance between
point \(i\) and all
points in the nearest cluster (separation).
- \(s(i)\) ranges from -1 to
1, where:
- Close to 1 : Well-clustered (good separation & cohesion).
- Around 0 : Overlapping clusters.
- Negative : Misclassified point (closer to another cluster).
A box plot summarizes the distribution of
silhouette scores per cluster:
- Median line → Represents the typical
silhouette score for the cluster.
- Box range → Shows the spread of
silhouette widths within each cluster.
- Outliers → Indicate misclassified
points with low or negative silhouette scores.
Using a ggplot2 boxplot, we can quickly compare clusters,
detect overlapping clusters, and identify poorly defined groups.
sil_score_kmeans <- silhouette(kmeans_model$cluster, dist(pca_reduced))
cat("Average Silhouette Score:", mean(sil_score_kmeans[, 3]))
Average Silhouette Score: 0.3922064
We have an average Silhouette score of approximately 0.4, indicating
moderate cluster separation and cohesion.
Some clusters, such as Cluster 9 and Cluster 1, have higher
Silhouette scores.
Clusters 1, 6, 7, and 8 contain some misclassified points with low
Silhouette scores, which contribute to lowering the overall K-Means
average Silhouette score.
library(ggplot2)
library(cluster)
# Convert silhouette object to a data frame
sil_df <- data.frame(
cluster = as.factor(sil_score_kmeans[, 1]),
width = sil_score_kmeans[, 3]
)
# Plot with ggplot2
ggplot(sil_df, aes(x = cluster, y = width, fill = cluster)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Silhouette Widths for K-Means Clustering",
x = "Cluster",
y = "Silhouette Width")

5.2 Hierarchical Clustering
Hierarchical Clustering and Distance Calculation
Hierarchical clustering is an unsupervised machine
learning technique used to group similar data points into
clusters without pre-specifying the number of clusters
(K). Unlike K-means, which assigns points to clusters based on
centroids, hierarchical clustering builds a tree-like structure
(dendrogram) to represent data similarity [1,2,5,15].
Hierarchical clustering operates in two main ways [5,15]:
Agglomerative (Bottom-Up, Most Common) → Starts
with each data point as its own cluster and iteratively
merges the most similar clusters.
Divisive (Top-Down) → Starts with all
points in one cluster and recursively splits into smaller
clusters.
in this work we will follow the first approach.
Distance Calculation:
The dist() function computes pairwise Euclidean
distance between observations in pca_sample.
By default, it uses Euclidean distance, defined
as:
\[
d(A,B) = \sqrt{(A_1 - B_1)^2 + (A_2 - B_2)^2 + \dots + (A_n - B_n)^2}
\]
This distance metric ensures that points with similar PCA-reduced
features are closer together.
Other possible distance metrics [5,11]:
- Manhattan Distance
(
method = "manhattan") → Sum of absolute differences.
- Minkowski Distance (generalized form of Euclidean
& Manhattan).
- Cosine Similarity. which is a metric used to
measure how similar two vectors are, based on the cosine of the angle
between them. It is widely used in text analysis, machine learning, and
clustering.
Cosine similarity between two vectors \(A\) and \(B\) is defined as:
\[
\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}
\]
where:
\(A \cdot B\) is the dot product
of the two vectors.
\(\|A\|\) and \(\|B\|\) are the magnitudes (norms) of
vectors \(A\) and \(B\).
\(\theta\) is the angle between
the vectors.
Focuses on the direction rather than magnitude,
making it useful for text and high-dimensional data.
Works well when comparing documents of different
lengths.
Efficient to compute using dot product operations.
Linkage Method:
Once the distances between points are computed, we need to determine
how to merge clusters at each step. The
hclust() function provides various linkage methods
[4,11,12,16]:
Single Linkage: Merges clusters based on the
shortest pairwise distance between points in different clusters. This
can lead to a chaining effect, where long, loose clusters are
formed.
Complete Linkage: Uses the maximum pairwise
distance between points in different clusters, leading to compact and
tightly bound clusters.
Ward’s Method (ward.D2): Minimizes the increase
in intra-cluster variance, resulting in balanced and compact
clusters.
Average Linkage calculates the average pairwise
distance between all points in two clusters and merges the clusters that
have the smallest average distance [4,11].
Mathematically, for two clusters \(A\) and \(B\), the average linkage distance is
defined as:
\[
d(A,B) = \frac{1}{|A| |B|} \sum_{i \in A} \sum_{j \in B} d(i,j)
\]
where:
- \(d(i,j)\) is the distance between
individual points \(i\) and \(j\).
- \(|A|\) and \(|B|\) are the number of points in clusters
\(A\) and \(B\).
Attaching package: 'proxy'
The following objects are masked from 'package:stats':
as.dist, dist
The following object is masked from 'package:base':
as.matrix
# Function to compute and plot average silhouette score
silhouette_analysis <- function(data, distance_metrics = c("euclidean", "manhattan","cosine"),
linkages = c("single","complete", "average", "ward.D"), k_values = 5:9) {
# Ensure data is numeric
data_numeric <- as.matrix(data)
results <- data.frame() # Store silhouette results
# Iterate over distance metrics
for (dist_metric in distance_metrics) {
# Compute distance matrix (cosine similarity for "cosine")
if (dist_metric == "cosine") {
distance_matrix <- proxy::dist(data_numeric, method = "cosine")
} else {
# Compute standard distance matrix for other metrics
distance_matrix <- dist(data_numeric, method = dist_metric)
}
# Iterate over linkage methods
for (linkage in linkages) {
# Perform hierarchical clustering
hc <- hclust(distance_matrix, method = linkage)
# Iterate over k values (number of clusters)
for (k in k_values) {
# Cut the dendrogram to form clusters
clusters <- cutree(hc, k = k)
# Compute silhouette score
sil <- silhouette(clusters, distance_matrix)
avg_sil <- mean(sil[, 3]) # Extract average silhouette score
# Store results
results <- rbind(results, data.frame(
Distance = dist_metric,
Linkage = linkage,
Clusters = k,
Avg_Silhouette = avg_sil
))
}
}
}
# Plot the results
ggplot(results, aes(x = Clusters, y = Avg_Silhouette, color = Linkage)) +
geom_line() + geom_point(size = 2) +
facet_wrap(~ Distance) +
labs(title = "Average Silhouette Score for Different Linkage and Distance Metrics",
x = "Number of Clusters (k)", y = "Average Silhouette Score") +
theme_minimal()
}
silhouette_analysis(pca_reduced)

Despite the Euclidean distance seams to have better average
silhouette, due to the nature of our data (mainly categorical encoded
variables), we will use the cosine similarity as
distance metric.
By the plot we also can see that the best average silhouette, for
this metric, is obtained with 9 clusters, and “average” linkage.
So let’s assume we want k=9 clusters, distance=“cosine” and
linkage=“average”:
---------------------
Welcome to dendextend version 1.19.0
Type citation('dendextend') for how to cite the package.
Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/
Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags:
https://stackoverflow.com/questions/tagged/dendextend
To suppress this message use: suppressPackageStartupMessages(library(dendextend))
---------------------
Attaching package: 'dendextend'
The following object is masked from 'package:stats':
cutree
pca_numeric <- pca_reduced[, !names(pca_reduced) %in% "cluster"]
# Convert to a numeric matrix
pca_matrix <- as.matrix(pca_numeric)
cosine_sim_matrix <- proxy::dist(pca_matrix, method = "cosine")
# Perform hierarchical clustering using cosine similarity
hclust_result <- hclust(cosine_sim_matrix, method = "average")
# Plot the dendrogram
plot(hclust_result, main = "Dendrogram of Hierarchical Clustering",
xlab = "Observations", ylab = "Height", cex = 0.7)

# Optional: Color branches for better visualization
dend <- as.dendrogram(hclust_result)
dend <- color_branches(dend, k = 9)
plot(dend, main = "Dendrogram")

# Cut tree into desired clusters
clusters <- cutree(hclust_result, k = 9)
# Add cluster labels to pca_reduced dataframe
pca_reduced$cluster <- as.factor(clusters)
# View cluster distribution
table(pca_reduced$cluster)
1 2 3 4 5 6 7 8 9
4225 4903 4076 1967 1338 510 2575 245 161
The vertical axis of a dendrogram represents the height, which
corresponds to the dissimilarity (or distance) between clusters at the
point of their merger. The height at which two clusters merge indicates
how different (distant) they are [4,6,12].
- Lower height: More similar clusters merge early.
- Higher height : More dissimilar clusters merge later.
After cutting the dendrogram into clusters, each observation is
assigned to a cluster.
We can analyze the distribution of these clusters and how the data
points are grouped.
rm(dend)
# View the cluster distribution
table(pca_reduced$cluster)
1 2 3 4 5 6 7 8 9
4225 4903 4076 1967 1338 510 2575 245 161
Visualize Cluster Assignments
# Define distinct colors for clusters
cluster_levels <- levels(factor(pca_reduced$cluster))
cluster_colors <- rainbow(length(cluster_levels)) # Generates unique colors
names(cluster_colors) <- cluster_levels # Map colors to cluster labels
# Assign colors temporarily
pca_reduced$color <- cluster_colors[as.character(pca_reduced$cluster)]
# Set material properties to prevent shading effects
material3d(col = "black", shininess = 50, specular = "black")
# 3D Scatter Plot with Solid Spheres
plot3d(pca_reduced$PC1, pca_reduced$PC2, pca_reduced$PC3,
col = pca_reduced$color, # Use colors for plotting
size = 2,
xlab = "PC1", ylab = "PC2", zlab = "PC3",
main = "Hierarchical Clustering")
#type = "s")
# Add 3D legend with matching colors
legend3d("topright", legend = names(cluster_colors),
col = cluster_colors, pch = 16, # Use solid points
title = "Clusters", cex = 1.2)
rglwidget()
Examine Cluster Profiles
The core idea is that both K-means and hierarchical clustering group
similar observations together, so once we’ve assigned cluster labels to
our observations, we can aggregate the original data by these labels.
This gives a cluster profile (similar to what we did for K-means), but
based on the clusters identified by hierarchical clustering
[4,11,16].
Just like with K-means, when we aggregate the original data by the
hierarchical cluster labels, we can interpret each cluster in terms of
its average feature values, such as the age distribution, gender, rating
preferences, or other features from our original dataset.
# Aggregate original features by cluster labels
cluster_profile <- aggregate(df_final, by = list(cluster = pca_reduced$cluster), FUN = mean)
# Display cluster profile as a table
knitr::kable(cluster_profile, caption = "Cluster Profile - Mean Values by Cluster") %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(width = "100%", height = "400px")
Cluster Profile - Mean Values by Cluster
|
cluster
|
rating
|
age
|
gender
|
unknown
|
Action
|
Adventure
|
Animation
|
Children
|
Comedy
|
Crime
|
Documentary
|
Drama
|
Fantasy
|
Film_Noir
|
Horror
|
Musical
|
Mystery
|
Romance
|
Sci_Fi
|
Thriller
|
War
|
Western
|
|
1
|
3.453728
|
31.15337
|
0.7682840
|
0.0000000
|
0.8932544
|
0.5024852
|
0.0000000
|
0.0009467
|
0.0863905
|
0.0586982
|
0
|
0.0433136
|
0
|
0.0000000
|
0.000000
|
0.0000000
|
0.0250888
|
0.2000000
|
0.3936095
|
0.4257988
|
0.1805917
|
0.0000000
|
|
2
|
3.717724
|
34.22741
|
0.7091577
|
0.0000000
|
0.0815827
|
0.0028554
|
0.0000000
|
0.0000000
|
0.0195798
|
0.0000000
|
0
|
0.9987763
|
0
|
0.0000000
|
0.000000
|
0.0000000
|
0.0000000
|
0.2694269
|
0.0458903
|
0.0236590
|
0.1892719
|
0.0000000
|
|
3
|
3.423209
|
33.25662
|
0.7335623
|
0.0000000
|
0.0004907
|
0.0034347
|
0.0000000
|
0.0000000
|
0.9872424
|
0.0301766
|
0
|
0.1243867
|
0
|
0.0000000
|
0.000000
|
0.0000000
|
0.0041708
|
0.3292444
|
0.0323847
|
0.0311580
|
0.0441609
|
0.0000000
|
|
4
|
3.411286
|
32.06355
|
0.6873411
|
0.0000000
|
0.0579563
|
0.1708185
|
0.3609558
|
0.6598882
|
0.4977123
|
0.0000000
|
0
|
0.1728521
|
0
|
0.0000000
|
0.009151
|
0.5200813
|
0.0050839
|
0.1108287
|
0.0254194
|
0.0315201
|
0.0142349
|
0.0000000
|
|
5
|
3.223468
|
31.12631
|
0.8168909
|
0.0000000
|
0.1771300
|
0.0284006
|
0.0000000
|
0.0000000
|
0.1210762
|
0.0538117
|
0
|
0.0560538
|
0
|
0.0000000
|
0.809417
|
0.0000000
|
0.0493274
|
0.0575486
|
0.1823617
|
0.4566517
|
0.0000000
|
0.0000000
|
|
6
|
3.621569
|
35.40000
|
0.8392157
|
0.0000000
|
0.4666667
|
0.2882353
|
0.0000000
|
0.0019608
|
0.3627451
|
0.0000000
|
0
|
0.3921569
|
0
|
0.0000000
|
0.000000
|
0.0000000
|
0.0000000
|
0.0176471
|
0.0000000
|
0.0000000
|
0.0156863
|
0.7137255
|
|
7
|
3.758447
|
33.87301
|
0.7922330
|
0.0003883
|
0.1390291
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0170874
|
0.4244660
|
0
|
0.6182524
|
0
|
0.1421359
|
0.000000
|
0.0000000
|
0.3250485
|
0.0691262
|
0.0547573
|
0.6636893
|
0.0000000
|
0.0000000
|
|
8
|
3.204082
|
33.20408
|
0.7551020
|
0.0000000
|
0.1877551
|
0.3918367
|
0.0938776
|
0.5755102
|
0.4489796
|
0.0897959
|
0
|
0.2612245
|
1
|
0.0000000
|
0.000000
|
0.0000000
|
0.0000000
|
0.1714286
|
0.4530612
|
0.0326531
|
0.0000000
|
0.0000000
|
|
9
|
3.695652
|
34.25466
|
0.7701863
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
1
|
0.0621118
|
0
|
0.0000000
|
0.000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0000000
|
0.0869565
|
0.0000000
|
Cluster Interpretation:
Based on the hierarchical cluster profile, we can have a detailed
interpretation of each cluster as follows: Cluster 1:
Rating: 3.45 (moderate rating)
Age: 31.15 (young adult users)
Gender: 0.77 (77% male users)
Action: 89.33% (strong interest in Action movies)
Adventure: 50.25% (moderate interest in Adventure movies)
Drama: 4.33% (low interest in Drama)
Other Genres: Some interest in Sci-Fi (20%) and Thriller (39%).
Interpretation: Cluster 1 represents a younger group of male users
(age ~31) who are highly engaged with Action and Adventure genres. These
users show some interest in Sci-Fi and Thriller, but Drama and other
genres are less significant for them. This group might be looking for
fast-paced, action-packed films with occasional interest in
adventure-driven plots. Cluster 2:
Rating: 3.72 (high rating)
Age: 34.23 (middle-aged users)
Gender: 0.71 (71% male users)
Action: 8.16% (moderate interest in Action movies)
Adventure: 0.29% (very low interest in Adventure movies)
Drama: 99.88% (very high interest in Drama)
Other Genres: Very strong preference for Documentary (99.88%).
Interpretation: Cluster 2 represents middle-aged male users who are
predominantly interested in Drama and Documentary films. They show
minimal interest in genres like Action and Adventure, suggesting they
prefer more narrative-driven or educational content rather than
thrilling or action-based genres. Cluster 3:
Rating: 3.42 (moderate rating)
Age: 33.26 (young adult users)
Gender: 0.73 (73% male users)
Action: 0.05% (negligible interest in Action movies)
Adventure: 0.34% (very low interest in Adventure movies)
Sci-Fi: 98.72% (strong interest in Sci-Fi movies)
Other Genres: Minor interest in Drama (12.44%) and some in Thriller (4.17%).
Interpretation: Cluster 3 represents a group of male users (age ~33)
who are particularly interested in Sci-Fi films. These users show very
little interest in action-based genres like Action and Adventure, and
their secondary interests lie in Drama and Thriller genres. This group
might have a strong preference for speculative or futuristic stories,
with a smaller attraction to more traditional genres. Cluster 4:
Rating: 3.41 (moderate rating)
Age: 32.06 (young adult users)
Gender: 0.69 (69% male users)
Action: 5.80% (low interest in Action movies)
Adventure: 17.08% (moderate interest in Adventure movies)
Sci-Fi: 50.00% (moderate interest in Sci-Fi movies)
Other Genres: Strong interest in Fantasy (52.09%) and Thriller (39.36%).
Interpretation: Cluster 4 represents young male users (age ~32) with
a diverse set of genre interests. Although Action is not a strong
interest, users in this cluster are highly interested in Fantasy and
Thriller genres, suggesting that they may be drawn to immersive,
other-worldly experiences and suspense-driven narratives. Cluster 5:
Rating: 3.22 (moderate rating)
Age: 31.13 (young adult users)
Gender: 0.81 (81% male users)
Action: 17.71% (moderate interest in Action movies)
Adventure: 2.84% (low interest in Adventure movies)
Sci-Fi: 81.69% (high interest in Sci-Fi movies)
Other Genres: Very low interest in Thriller (0.49%) and Drama (5.61%).
Interpretation: Cluster 5 consists of younger male users (age ~31)
who are highly interested in Sci-Fi films, with moderate interest in
Action movies. The group shows minimal interest in genres like Drama or
Thriller, indicating a preference for futuristic or speculative content
with less emphasis on traditional drama or suspense. Cluster 6:
Rating: 3.62 (high rating)
Age: 35.40 (older users)
Gender: 0.84 (84% male users)
Action: 46.67% (moderate interest in Action movies)
Adventure: 28.82% (moderate interest in Adventure movies)
Sci-Fi: 36.27% (moderate interest in Sci-Fi movies)
Other Genres: Strong interest in Thriller (71.37%).
Interpretation: Cluster 6 represents older male users (age ~35) who
have a balanced interest in Action, Adventure, and Sci-Fi movies, with a
particularly strong attraction to Thriller genres. This suggests that
they enjoy intense or suspenseful content, with a moderate interest in
action-packed or adventure-filled narratives. Cluster 7:
Rating: 3.76 (high rating)
Age: 33.87 (middle-aged users)
Gender: 0.79 (79% male users)
Action: 13.90% (moderate interest in Action movies)
Adventure: 0.00% (no interest in Adventure movies)
Sci-Fi: 42.45% (moderate interest in Sci-Fi movies)
Other Genres: Strong interest in Thriller (66.37%) and some interest in Horror (34.27%).
Interpretation: Cluster 7 represents middle-aged male users (age ~34)
with a strong interest in Thriller films. Sci-Fi and Horror also attract
some attention, but there is very little interest in Adventure or Action
genres. This group likely enjoys intense narratives with a mix of
suspense, fear, and speculative elements. Cluster 8:
Rating: 3.20 (moderate rating)
Age: 33.20 (young adult users)
Gender: 0.76 (76% male users)
Action: 18.78% (moderate interest in Action movies)
Adventure: 39.18% (moderate interest in Adventure movies)
Sci-Fi: 44.90% (moderate interest in Sci-Fi movies)
Other Genres: Strong interest in Fantasy (44.90%) and some interest in Drama (26.12%).
Interpretation: Cluster 8 represents young male users (age ~33) with
a diverse interest in Action, Adventure, Sci-Fi, and Fantasy genres.
They are likely fans of adventurous, speculative, and action-packed
movies, but also have some appreciation for Drama. This group is likely
drawn to films that offer both excitement and imaginative elements.
Cluster 9:
Rating: 3.70 (moderate rating)
Age: 34.25 (middle-aged users)
Gender: 0.77 (77% male users)
Action: 0.00% (no interest in Action movies)
Adventure: 0.00% (no interest in Adventure movies)
Sci-Fi: 0.00% (no interest in Sci-Fi movies)
Other Genres: Strong interest in Documentary (100%) and Drama (62.11%).
Interpretation: Cluster 9 represents middle-aged male users (age ~34)
who have a significant interest in Documentary films, with some
preference for Drama as well. There is no interest in genres like
Action, Adventure, or Sci-Fi, suggesting these users are more drawn to
factual or narrative-based content rather than entertainment-driven
films.
The analysis of these clusters shows distinct preferences based on
age and genre interests.
Younger male users tend to have stronger preferences for Action,
Adventure, and Sci-Fi genres, while older users and those in middle age
show a higher preference for Documentary and Drama genres.
Each cluster highlights specific patterns in genre interests and
movie ratings, which can guide content recommendations and targeted
movie selections for different user demographics.
We can calculate the average silhouette score
# Calculate the silhouette score
#sil_score_hclust <- silhouette(clusters, dist(pca_reduced))
sil <- silhouette(clusters, cosine_sim_matrix)
avg_sil <- mean(sil[, 3]) # Extract average silhouette score
#cat("Average Silhouette Score:", mean(sil_score_hclust[, 3]))
cat("Average Silhouette Score:", avg_sil)
Average Silhouette Score: 0.4304371
# Convert silhouette object to a data frame
sil_df_hclust <- data.frame(
cluster = as.factor(sil[, 1]),
width = sil[, 3]
)
# Plot with ggplot2
ggplot(sil_df_hclust, aes(x = cluster, y = width, fill = cluster)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Silhouette Widths for Hierarchical Clustering",
x = "Cluster",
y = "Silhouette Width")

5.3 DBSCAN (Density-Based Spatial Clustering of Applications with
Noise)
DBSCAN is a clustering algorithm that groups points based on their
density, unlike K-Means or Hierarchical clustering which rely on
distance metrics. DBSCAN can be especially useful when the data has
noise or varying density clusters [7,10,11,25].
DBSCAN requires two parameters:
- eps (epsilon): Defines the maximum radius within
which points are considered neighbors.
A smaller eps may lead to many small clusters or noise, while a
larger eps may merge clusters incorrectly.
- minPts: Specifies the minimum number of points
required to form a dense region (i.e., a core point).
If a point has at least minPts neighbors within the eps radius, it
becomes a core point and helps form a cluster.
Setting eps too high may merge distinct clusters, while setting it
too low may treat too many points as noise.
Concerning minPts a rule of thumb or common heuristic is
minPts = 2 * dimensions of the dataset (D).
In our case the dataset has 9 features, minPts ≈ 18 is a good
starting point.
Attaching package: 'dbscan'
The following object is masked from 'package:stats':
as.dendrogram
# Function to calculate silhouette scores for different eps and minPts in DBSCAN
silhouette_for_eps_minPts <- function(data, eps_values, minPts_values) {
# Calculate the cosine similarity matrix
cosine_sim_matrix_db <- proxy::dist(data, method = "cosine")
# Initialize a data frame to store results
silhouette_scores_db <- data.frame(eps = numeric(0), minPts = numeric(0), silhouette_score = numeric(0))
# Iterate over each eps value and minPts value
for (eps in eps_values) {
for (minPts in minPts_values) {
# Perform DBSCAN with the current eps and minPts
dbscan_result <- dbscan(cosine_sim_matrix_db, eps = eps, minPts = minPts)
# If DBSCAN found more than 1 cluster (ignoring noise, which is labeled 0)
if (length(unique(dbscan_result$cluster)) > 1) {
# Compute the silhouette score
sil_db <- silhouette(dbscan_result$cluster, cosine_sim_matrix_db)
avg_sil_db <- mean(sil_db[, 3], na.rm = TRUE) # Extract the average silhouette score
# Append to results
silhouette_scores_db <- rbind(silhouette_scores_db, data.frame(eps = eps, minPts = minPts, silhouette_score = avg_sil_db))
}
}
}
# Return the silhouette scores data frame
return(silhouette_scores_db)
}
# Example usage:
# Define eps values and minPts values to test
eps_values <- seq(0.1, 0.4, by = 0.1) # eps values from 0.1 to 0.4
minPts_values <- seq(9, 18, by = 3) # minPts values from 9 to 18
pca_reduced <- pca_reduced[, !names(pca_reduced) %in% "cluster"]
# Calculate silhouette scores for each combination of eps and minPts
sil_scores_db <- silhouette_for_eps_minPts(pca_reduced, eps_values, minPts_values)
# Plot the silhouette scores for each eps and minPts
ggplot(sil_scores_db, aes(x = eps, y = silhouette_score, color = factor(minPts))) +
geom_line() +
geom_point() +
labs(title = "Silhouette Score vs. eps and minPts (DBSCAN with Cosine Similarity)",
x = "eps", y = "Average Silhouette Score", color = "minPts") +
theme_minimal()

In this case we will consider eps = 0.11 and
minPts=18 because higher values of eps will reduce the number
of clusters to 1 or 2
pca_reduced <- pca_reduced[, !names(pca_reduced) %in% "cluster"]
# Calculate cosine similarity matrix (using 'dist' function from 'proxy' package)
cosine_sim_matrix_db <- proxy::dist(pca_reduced, method = "cosine")
# Run DBSCAN with cosine similarity matrix
dbscan_result <- dbscan(cosine_sim_matrix_db, eps = 0.11, minPts = 18)
# Add cluster labels to data
pca_reduced$cluster <- as.factor(dbscan_result$cluster) # "-1" represents noise
# Check for NAs in the cluster column
table(pca_reduced$cluster) # This will show if there are NA values in the cluster column
0 1 2 3 4 5 6 7 8 9 10
143 16976 369 2174 161 27 30 28 26 47 19
# Define distinct colors for clusters
cluster_levels <- levels(factor(pca_reduced$cluster))
cluster_colors <- rainbow(length(cluster_levels)) # Generates unique colors
names(cluster_colors) <- cluster_levels # Map colors to cluster labels
# Assign colors temporarily
pca_reduced$color <- cluster_colors[as.character(pca_reduced$cluster)]
# Set material properties to prevent shading effects
material3d(col = "black", shininess = 50, specular = "black")
# 3D Scatter Plot with Solid Spheres
plot3d(pca_reduced$PC1, pca_reduced$PC2, pca_reduced$PC3,
col = pca_reduced$color, # Use colors for plotting
size = 2,
xlab = "PC1", ylab = "PC2", zlab = "PC3",
main = "Hierarchical Clustering")
#type = "s")
# Add 3D legend with matching colors
legend3d("topright", legend = names(cluster_colors),
col = cluster_colors, pch = 16, # Use solid points
title = "Clusters", cex = 1.2)
rglwidget()
# Convert cluster labels to numeric for silhouette calculation
# We will keep the factor labels for visualization but use numeric labels for calculations
pca_reduced$cluster_numeric <- as.numeric(as.character(pca_reduced$cluster))
# Add cluster labels to data
pca_reduced$cluster <- as.factor(dbscan_result$cluster) # "0" represents noise
# Compute the silhouette score
if (length(unique(dbscan_result$cluster)) > 1) { # Only calculate if more than 1 cluster exists
silhouette_score_db <- silhouette(dbscan_result$cluster, as.matrix(cosine_sim_matrix_db))
# Get the average silhouette score
avg_silhouette_score <- mean(silhouette_score_db[, 3], na.rm = TRUE) # Remove NAs from noise
# Print the average silhouette score
print(paste("Average Silhouette Score:", avg_silhouette_score))
} else {
print("DBSCAN found only one cluster or all points are noise. Silhouette score is not meaningful.")
}
[1] "Average Silhouette Score: -0.282821971509348"
# Convert silhouette object to a data frame for easier plotting
sil_df_dbscan <- data.frame(
cluster = as.factor(silhouette_score_db[, 1]), # Cluster labels
width = silhouette_score_db[, 3] # Silhouette width
)
# Plot silhouette scores for DBSCAN clusters
ggplot(sil_df_dbscan, aes(x = cluster, y = width, fill = cluster)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Silhouette Widths for DBSCAN Clustering",
x = "Cluster",
y = "Silhouette Width") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability

A negative silhouette score indicates that a point is closer to a
different cluster than its own, meaning it’s not well clustered.
Cluster 0 represents the noise points.
These points are considered as noise or outliers in DBSCAN because
they don’t fit well into any cluster. They might be far from any dense
regions, leading to poor cohesion (a high distance to points in the same
cluster) and poor separation (close proximity to points in other
clusters) [7,19,25].
# Add cluster labels to data
pca_reduced$cluster <- as.factor(dbscan_result$cluster) # "0" represents noise
table(pca_reduced$cluster)
0 1 2 3 4 5 6 7 8 9 10
143 16976 369 2174 161 27 30 28 26 47 19
DBSCAN forms one large cluster and several small clusters. DBSCAN’s
behavior is driven by density-based clustering, and several factors
contribute to this outcome, but mainly due to Uneven Density in Data.
DBSCAN works best when clusters have similar density [7,8].
As our data has one highly dense region and several low-density
regions, DBSCAN forms one big cluster in the dense area and forms small
clusters where density is lower.
This happens because low-density regions fail to reach the
minPts requirement, creating smaller clusters or noise.
5.4 Gaussian Mixture Models (GMMs)
Gaussian Mixture Models (GMMs) are a type of probabilistic model that
assumes that the data is generated from a mixture of several Gaussian
distributions, each with its own mean and variance [3,15]
Unlike hard clustering algorithms (e.g., K-Means), GMM assigns
probabilities to each data point for being a part of a cluster, making
it a soft clustering technique. A data point can belong to multiple
clusters, with varying probabilities [15,26].
Parameters:
Mean (μ): The central point of each Gaussian
distribution.
Covariance (Σ): The shape and spread of the
distribution.
Weights (π): The proportion of each Gaussian
component in the overall mixture.
GMMs are typically trained using the Expectation-Maximization (EM)
algorithm, which iteratively updates the parameters of the Gaussian
components to best fit the data [21,26]:
E-step (Expectation): Estimate the probability of each data point
belonging to each cluster.
M-step (Maximization): Update the parameters (mean, covariance,
and weights) based on the estimated probabilities.
GMMs are used in clustering, density estimation, and anomaly
detection. They are particularly useful when the data is expected to
have overlapping clusters or when data is distributed in elliptical
shapes. In essence, GMMs provide a flexible and powerful framework for
modeling complex datasets with multiple underlying patterns [11,21].
Model Selection Process:
The Mclust() function automatically selects the optimal
number of clusters (G) based on the Log-Likelihood
(logL) [26].
Higher log-likelihood values indicate a better fit of the model to
the data. However, relying solely on log-likelihood is not sufficient to
determine the ideal number of clusters, as it tends to increase with
model complexity.
To address this, we can use criteria like the Bayesian
Information Criterion (BIC) and Integrated Completed
Likelihood (ICL), which help balance model fit with model
complexity [2,3,11].
Both of these metrics penalize the inclusion of too many parameters,
thereby encouraging simpler models with fewer clusters.
BIC (Bayesian Information Criterion) is defined
as:
\[
\text{BIC} = -2 \cdot \text{logL} + k \cdot \log(N)
\]
Where:
- logL is the log-likelihood of the model.
- k is the number of parameters in the model.
- N is the number of data points.
A lower BIC value indicates a better model fit, as
it reflects both the likelihood of the model and its complexity
[2,3].
ICL (Integrated Completed Likelihood) is a variant
of BIC that accounts for the uncertainty in cluster assignments. It is
particularly useful when working with soft clustering methods, like
GMMs, where data points can belong to multiple clusters with different
probabilities [2,11]. The formula for ICL is:
\[
\text{ICL} = \text{BIC} - 2 \cdot \log\left(\sum_{i=1}^{N}
\sum_{g=1}^{G} \gamma_{ig} \cdot \delta_{g}\right)
\]
Where:
- \(\gamma_{ig}\) represents the
probability that data point i belongs to cluster
g.
- \(\delta_{g}\) is the
log-likelihood of the cluster assignments for all data points.
Similar to BIC, a lower ICL value
indicates a better model fit with fewer clusters, while also accounting
for the uncertainty in the cluster membership. If the
BIC and ICL values are close to each
other, this suggests that the clusters are well-separated, indicating a
stable and reliable clustering solution [2,6,11].
Additionally, we can use the silhouette score as an extra evaluation
method.
Therefore, we will aim to determine the optimal number of clusters
based on these three criteria:
Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.
pca_reduced <- pca_reduced[, !names(pca_reduced) %in% "cluster"]
pca_reduced <- pca_reduced[, !names(pca_reduced) %in% "cluster_numeric"]
# Initialize empty vectors to store ICL, Silhouette scores, and BIC values
icl_values <- numeric(9)
silhouette_scores_hc <- numeric(9)
bic_values <- numeric(9)
# Loop through different values of G (number of clusters)
for (k in 2:10) { # Starting from 2 clusters (avoid 1 cluster, as silhouette is not defined for it)
# Fit the GMM model
gmm_model <- Mclust(pca_reduced, G = k)
# Store the ICL value
icl_values[k] <- gmm_model$icl
# Store the BIC value
bic_values[k] <- gmm_model$bic
# Compute the dissimilarity matrix using Euclidean distance
dist_matrix <- dist(pca_reduced) # Euclidean distance matrix
# Calculate the Silhouette Score
silhouette_score_hc <- silhouette(gmm_model$classification, dist_matrix)
# Extract the silhouette widths (the third column)
if (inherits(silhouette_score_hc, "silhouette")) {
sil_width <- silhouette_score_hc[, 3] # The third column contains silhouette widths
# Remove NA values and calculate the mean silhouette score
sil_width <- sil_width[!is.na(sil_width)]
if (length(sil_width) > 0) {
silhouette_scores_hc[k] <- mean(sil_width)
} else {
silhouette_scores_hc[k] <- NA
}
} else {
silhouette_scores_hc[k] <- NA # If silhouette object is invalid, set as NA
}
}
# Create a data frame with the results
results_hc <- data.frame(
Clusters = 2:10, # Adjusted the cluster range to match the loop
ICL = icl_values[2:10],
BIC = bic_values[2:10],
Silhouette = silhouette_scores_hc[2:10]
)
# Print the results
print(results_hc)
Clusters ICL BIC Silhouette
1 2 -385524.5 -385396.7 0.27133048
2 3 -361894.9 -361758.8 0.09144271
3 4 -328057.2 -328037.3 0.11390773
4 5 -312134.9 -312069.1 0.05370641
5 6 -253332.7 -253274.7 0.05563136
6 7 -243574.3 -242758.1 -0.05215898
7 8 -187363.8 -187145.9 0.04250184
8 9 -178999.9 -178606.1 0.09856355
9 10 -145051.0 -144839.3 0.09547709
# Find the best number of clusters based on ICL, BIC, and Silhouette
best_icl_clusters <- which.min(icl_values[2:10]) # Minimize ICL (adjusted for the range)
best_bic_clusters <- which.min(bic_values[2:10]) # Minimize BIC (adjusted for the range)
best_silhouette_clusters <- which.max(silhouette_scores_hc[2:10]) # Maximize Silhouette Score
# Print the best number of clusters based on each metric
cat("Best number of clusters based on ICL:", best_icl_clusters, "\n")
Best number of clusters based on ICL: 1
cat("Best number of clusters based on BIC:", best_bic_clusters, "\n")
Best number of clusters based on BIC: 1
cat("Best number of clusters based on Silhouette:", best_silhouette_clusters, "\n")
Best number of clusters based on Silhouette: 1
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2) + 0.1)
# Plot BIC and ICL values vs number of clusters
plot(results_hc$Clusters, results_hc$BIC, type = "b", col = "blue", pch = 19,
xlab = "Number of Clusters", ylab = "BIC", main = "BIC vs Number of Clusters",
ylim = range(c(results_hc$BIC, results_hc$ICL)), cex.main = 1.2, cex.lab = 1.1)
lines(results_hc$Clusters, results_hc$ICL, type = "b", col = "red", pch = 19)
legend("topright", legend = c("BIC", "ICL"), col = c("blue", "red"), pch = 19, cex = 0.5)
# Plot silhouette scores vs number of clusters
plot(results_hc$Clusters, results_hc$Silhouette, type = "b", col = "green", pch = 19,
xlab = "Number of Clusters", ylab = "Silhouette Score", main = "Silhouette vs Number of Clusters",
cex.main = 1.2, cex.lab = 1.1)

# Choose the number of clusters (e.g., the one with the lowest BIC)
optimal_G <- which.min(icl_values)
print(optimal_G)
[1] 2
Let’s choose number of clusters = 7 as a compromise
between BIC and Silhouette:
# Load required libraries
library(mclust) # For Gaussian Mixture Models
library(rgl) # For 3D visualization
pca_reduced <- pca_reduced[, !names(pca_reduced) %in% "cluster"]
pca_reduced <- pca_reduced[, !names(pca_reduced) %in% "cluster_numeric"]
# Perform Gaussian Mixture Model (GMM) clustering using all 9 principal components
# Apply the GMM model with the optimal number of clusters
gmm_model <- Mclust(pca_reduced, G = 7)
pca_reduced$cluster <- as.factor(gmm_model$classification) # Extract cluster labels
# Define distinct colors for clusters
cluster_levels <- levels(pca_reduced$cluster) # Extract unique cluster labels
cluster_colors <- rainbow(length(cluster_levels)) # Generate distinct colors
names(cluster_colors) <- cluster_levels # Map colors to cluster labels
# Assign colors temporarily for plotting
pca_reduced$color <- cluster_colors[as.character(pca_reduced$cluster)]
# Set material properties to prevent shading effects
material3d(col = "black", shininess = 50, specular = "black")
# 3D Scatter Plot (Using First 3 PCs for Visualization)
plot3d(pca_reduced$PC1, pca_reduced$PC2, pca_reduced$PC3,
col = pca_reduced$color, # Use assigned colors
size = 2,
xlab = "PC1", ylab = "PC2", zlab = "PC3",
main = "Gaussian Mixture Model Clustering")
#type = "s")
# Add 3D legend with correctly assigned colors
legend3d("topright", legend = names(cluster_colors),
col = cluster_colors, pch = 16,
title = "Clusters", cex = 1.2)
rglwidget()
print(table(pca_reduced$cluster)) # Count points in each cluster
1 2 3 4 5 6 7
3881 1941 2559 2933 1823 2742 4121
summary(gmm_model) # Detailed model summary
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VEV (ellipsoidal, equal shape) model with 7 components:
log-likelihood n df BIC ICL
-97400.69 20000 336 -198129 -198489.5
Clustering table:
1 2 3 4 5 6 7
3881 1941 2559 2933 1823 2742 4121
# Cluster size distribution
cluster_sizes <- table(gmm_model$classification)
barplot(cluster_sizes, main = "Cluster Size Distribution", col = "skyblue", ylab = "Number of Points", xlab = "Cluster")

# Create a density plot for each Gaussian component
ggplot(pca_reduced, aes(x = PC1)) +
geom_density(aes(y = after_stat(density), color = factor(gmm_model$classification)),
adjust = 1) + # adjust bandwidth for smoother curve
labs(title = "Density of GMM Components", x = "Principal Component 1", y = "Density") +
theme_minimal()

We can represent the distribution of the data along the first
principal component (PC1), with the density estimates of each Gaussian
mixture component overlaid
The y-axis represents the density of the data along PC1.
This is the estimated probability density function (PDF) of our data.
Higher values indicate areas where data points are more concentrated,
and lower values indicate sparser regions in the data.
The density plot is a smooth curve that estimates the likelihood of
data points appearing in particular ranges of PC1.
Each color in the plot corresponds to a Gaussian mixture component.
In other words, each component in the Gaussian Mixture Model (GMM)
represents a cluster of data points with a certain probability
distribution.
Peak in the density plot with one color, it means that the data
points in that region are likely part of a specific
cluster/component.
The spread of each component’s density curve tells us how variable
the data is for that component in PC1.
# Get the means and covariance of the GMM
means <- gmm_model$parameters$mean
covariances <- gmm_model$parameters$variance$sigma
# Create a plot with data points and the ellipses
ggplot(pca_reduced, aes(x = PC1, y = PC2, color = factor(gmm_model$classification))) +
geom_point() +
stat_ellipse(data = pca_reduced, aes(x = PC1, y = PC2, color = factor(gmm_model$classification)),
level = 0.95) + # 95% ellipse level
labs(title = "GMM Clusters with Ellipses", x = "Principal Component 1", y = "Principal Component 2") +
theme_minimal()

Each Gaussian component in the mixture has its own mean (center),
covariance (spread and orientation), and weight (proportion in the
mixture) [11,26].
These ellipsoids represent the 95% confidence region for the data
points belonging to each Gaussian component. This means that about 95%
of the data points from each Gaussian cluster are expected to fall
within the bounds of the ellipse. The shape and orientation of the
ellipse reflect the covariance of the Gaussian distribution. In other
words, the ellipse shows how the data points are spread around the mean
in each direction.
Shape: The shape of the ellipse reflects the correlation
between the two principal components (PC1 and PC2). If the ellipse is
elongated along a certain direction, it suggests a strong correlation
between those components [3,26].
Size: The size of the ellipse indicates the spread of the data
around the mean. A larger ellipse means a wider spread of the data
points, while a smaller ellipse means the data is more concentrated
around the mean [3,26}.
If the ellipses of two or more Gaussian components are overlapping,
it suggests that the clusters are not well-separated, which could
indicate that the model is having difficulty distinguishing between
those components. If the ellipses are distinct and do not overlap, it
suggests that the Gaussian components are well-separated, indicating
clear clusters in the data. The orientation and shape of the ellipses
provide information about how the data points are distributed and
correlated with each other in the 2D space defined by the first two
principal components (PC1 and PC2) [19,21,26].
print(gmm_model$G)
[1] 7
# Check the means (centroids) of the clusters
print(gmm_model$parameters$mean)
[,1] [,2] [,3] [,4] [,5] [,6]
PC1 -1.10873572 -0.3125212 -0.01643768 0.87257391 0.09365493 2.404627672
PC2 -0.90294709 1.2659277 1.02817860 0.39918925 1.19815329 -1.240036446
PC3 0.01347368 -0.2590429 1.39826330 0.58305214 0.42668244 -0.392595008
PC4 -0.12325277 0.8066930 0.69020113 -1.56837433 0.75229517 0.506227496
PC5 -0.16048611 0.7742531 -0.11691852 -0.76521263 0.70804904 0.149228344
PC6 0.34237495 -0.3127210 0.31517202 0.11123301 -0.54802259 -0.565548416
PC7 -0.06465703 0.5943089 0.26821261 -0.11382306 -0.07353645 -0.591376398
PC8 0.37028347 0.4929700 0.01303556 0.01860943 -0.17823940 -0.009813292
PC9 -0.10955791 -0.2983220 0.17853890 -0.01208351 0.37045197 -0.377146651
[,7]
PC1 -1.060821861
PC2 -0.371886819
PC3 -1.099584889
PC4 -0.244888903
PC5 -0.008291334
PC6 0.168581547
PC7 0.121921076
PC8 -0.515248071
PC9 0.227848858
6. Main Results
Compare Clustering Results with Adjusted Rand Index (ARI):
The Adjusted Rand Index is one of the most common measures for
comparing clustering results. It compares the pairwise grouping of data
points between two clusters and corrects for random chance [12,29].
Works well for all types of clustering results (hard or soft). It
takes into account both the correctness (points correctly clustered) and
incorrectness (points incorrectly clustered). Range: The ARI ranges from
−1 (completely dissimilar) to 1 (perfect similarity), with 0 meaning
random clustering [11,12,29].
# Perform k-means clustering (e.g., 5 clusters)
kmeans_result <- kmeans(pca_reduced, centers = 9)
kmeans_clusters <- kmeans_result$cluster
# Perform hierarchical clustering
pca_numeric <- pca_reduced[, !names(pca_reduced) %in% "cluster"]
# Convert to a numeric matrix
pca_matrix <- as.matrix(pca_numeric)
cosine_sim_matrix <- proxy::dist(pca_matrix, method = "cosine")
# Perform hierarchical clustering using cosine similarity
hierarchical_result <- hclust(cosine_sim_matrix, method = "average")
# Cut the dendrogram into 5 clusters
hierarchical_clusters <- cutree(hierarchical_result, k = 9)
dbscan_result <- dbscan(cosine_sim_matrix , eps = 0.11, minPts = 18)
pca_reduced$cluster <- as.factor(dbscan_result$cluster) # "0" represents noise
dbscan_clusters <- ifelse(dbscan_result$cluster == 0, 0, dbscan_result$cluster)
gmm_result <- Mclust(pca_reduced, G = 7) # G = 7 for 7 clusters
gmm_clusters <- gmm_result$classification
# Function to compute Fowlkes-Mallows Index (FMI)
fowlkes_mallows <- function(cluster1, cluster2) {
# Create a contingency table of the clusters
table_ <- table(cluster1, cluster2)
# Compute the number of pairs
tp <- sum(choose(table_, 2))
fp_fn <- sum(choose(table_ - diag(table_), 2))
# Fowlkes-Mallows Index formula
FMI <- tp / sqrt((tp + fp_fn) * (tp + fp_fn))
return(FMI)
}
# Compute FMI for all combinations
fmi_kmeans_hierarchical <- fowlkes_mallows(kmeans_clusters, hierarchical_clusters)
fmi_kmeans_dbscan <- fowlkes_mallows(kmeans_clusters, dbscan_clusters)
fmi_kmeans_gmm <- fowlkes_mallows(kmeans_clusters, gmm_clusters)
fmi_dbscan_gmm <- fowlkes_mallows(dbscan_clusters, gmm_clusters)
fmi_hierarchical_dbscan <- fowlkes_mallows(hierarchical_clusters, dbscan_clusters)
fmi_hierarchical_gmm <- fowlkes_mallows(hierarchical_clusters, gmm_clusters)
# Store results
fmi_results <- data.frame(
Method1 = c("k-means", "k-means", "k-means", "DBSCAN", "DBSCAN", "hierarchical"),
Method2 = c("hierarchical", "DBSCAN", "GMM", "hierarchical", "GMM", "GMM"),
FMI = c(fmi_kmeans_hierarchical, fmi_kmeans_dbscan, fmi_kmeans_gmm,
fmi_dbscan_gmm, fmi_hierarchical_dbscan, fmi_hierarchical_gmm)
)
# Print results
print(fmi_results)
Method1 Method2 FMI
1 k-means hierarchical 0.2486201
2 k-means DBSCAN 0.4201394
3 k-means GMM 0.5029157
4 DBSCAN hierarchical 0.2928019
5 DBSCAN GMM 0.1814604
6 hierarchical GMM 0.5269115
# Load necessary libraries
library(ggplot2)
library(reshape2)
# Create a dataframe with FMI results
fmi_results <- data.frame(
Method1 = c("k-means", "k-means", "k-means", "DBSCAN", "DBSCAN", "hierarchical"),
Method2 = c("hierarchical", "DBSCAN", "GMM", "hierarchical", "GMM", "GMM"),
FMI = c(0.5049704, 0.3694232, 0.3559871, 0.3291647, 0.2027890, 0.4408476)
)
# Create a heatmap-friendly matrix
fmi_matrix <- acast(fmi_results, Method1 ~ Method2, value.var = "FMI")
# Convert to long format for ggplot2
fmi_long <- melt(fmi_matrix, na.rm = TRUE)
# Plot the heatmap
ggplot(fmi_long, aes(Var1, Var2, fill = value)) +
geom_tile() +
geom_text(aes(label = round(value, 3)), color = "white", size = 5) +
scale_fill_gradient(low = "blue", high = "red") +
theme_minimal() +
labs(title = "Fowlkes-Mallows Index (FMI) Heatmap",
x = "Method 1",
y = "Method 2",
fill = "FMI Score")

The FMI (Fowlkes-Mallows Index) measures the similarity between
clustering results, ranging from 0 (no similarity) to 1 (perfect match)
[13].
We are using Cosine Similarity for DBSCAN & Hierarchical
Clustering.
K-means vs. Hierarchical (0.505) – The Highest
Similarity
Despite using different techniques, these two methods show moderate
agreement. K-means clusters based on Euclidean distances, while
hierarchical clustering (with cosine similarity) captures relationships
based on angular separation [13,17].
The moderate FMI score suggests that hierarchical clustering (even
with cosine similarity) still detects some of the structures that
K-means finds [12].
However, since cosine similarity is good for high-dimensional, sparse
data, hierarchical clustering may be capturing more meaningful groups
than K-means [3].
K-means vs. DBSCAN (0.369) – Moderate-Low
Similarity
K-means uses Euclidean distance, whereas DBSCAN was run with cosine
similarity. DBSCAN with cosine similarity focuses on dense regions based
on angles between points, whereas K-means forces spherical clusters
[7,10,17].
Since DBSCAN (cosine) had poor silhouette scores (~0.1 max), this
suggests that DBSCAN struggled to find meaningful clusters, leading to
weaker agreement with K-means [12].
K-means vs. GMM (0.356) – Moderate-Low
Similarity
GMM allows soft assignments and models data as Gaussian
distributions, while K-means forces hard spherical clustering.The
moderate-low similarity suggests that GMM identified overlapping
clusters, while K-means strictly separated data into distinct groups
[12,17].
If the data had non-Gaussian distributions, this could explain the
lower similarity [2].
Since K-means and GMM both rely on distance-based clustering, their
results are somewhat similar, but GMM’s flexibility creates differences
[26].
DBSCAN vs. Hierarchical (0.329) – Lower Than
Expected!
Both methods used cosine similarity, so we might expect a higher
similarity. DBSCAN had poor silhouette scores (~0.1 max), meaning it
likely failed to form meaningful clusters. Hierarchical clustering,
however, was still able to detect structures using cosine similarity
[12,13].
This means DBSCAN and Hierarchical clustering (cosine) found very
different cluster assignments, despite using the same similarity metric
[12].
If DBSCAN labeled too many points as noise, it could explain the
lower agreement [7].
DBSCAN vs. GMM (0.203) – The Lowest Similarity
DBSCAN (cosine) and GMM (Euclidean-based Gaussian distributions) are
completely different approaches. DBSCAN (cosine) may have struggled to
create clusters, meaning its clusters barely align with GMM’s
Gaussian-based groups [12,26].
GMM assumes clusters follow Gaussian distributions, while DBSCAN
relies on density estimation—these approaches don’t naturally align well
[7,26].
The low FMI confirms that DBSCAN (cosine similarity) is not capturing
meaningful structures that other methods are detecting [12].
Hierarchical vs. GMM (0.441) – Moderate
Similarity
Hierarchical clustering (cosine) and GMM (Euclidean-based Gaussian
models) still share some similarities. Hierarchical clustering, even
with cosine similarity, can still detect some overlapping structures
similar to GMM [12,13].
However, since GMM relies on probabilistic Gaussian models, and
hierarchical clustering does not, their differences reduce similarity
[26].